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ABSTRACT 

The Mantel'Haenszel procedure (MH) is a practical, 
inexpensive, and powerful way to detect test items that function 
differently in two groups of examinees. MH is a natural outgrowth of 
previously suggested chi square methods, and it is also related to 
methods based on item response theory. The study of items that 
function differently for two groups of examinees, originally called 
item bias research, focuses on the fact that different groups of 
examinees may react differently to the same test guei&tion. The 
preferred, more neutral, term is differential item functioning (DIE). 
In studying DIF, members of the focal group and the reference group 
should be comparable. The item data may be arranged into 2x2 
tables. Chi square procedures test a hypothesis, but do not produce a 
parametric measure of the amount of DIF exhibited by the studied 
item. The MH chi square test provides a measure of the size of the 
departure of the data from the null hypothesis. The data are 
presented in 2 x 2 tables, and the measure of DIF is in the scale of 
differences in item difficulty as measured in Educational Testing 
Services delta scale. MH is significantly less expensive to use than 
item response analyses. (GDC) 
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1. INTRODUCTION AND NOTATION 

Holland (1985) proposed the use of the Mantel-Haenszel procedure as a prac- 
tical and powerful way to detect test items tha^ function di/ferently in two 
groups of examinees. In this paper we show how this use of the Mantel-Haenszel 
(MH) procedure is a natural outgrowth of the previously suggested chi-square 
procedures of Scheuneman (1979), Marascuilo and Slaughter (1981), Mellenberg 
(1982), and others and we show how the MH procedure relates to methods based on 
item response theory. Lord (1980). 

The study of items that function differently for two groups of examinees 
has a long history. Originally called "item bias" research, modern approaches 
focus on the fact that different groups of examinees may react differently to 
the same test question. These differences are worth exploring since they may 
shed light both on the test question and on the experiences and backgrounds of 
the different groups of examinees. We prefer the more neutral terms, differen- 
tial item performance or differential item functioning, (i.e., dif ) , to item 
bias since in many examples of items that exhibit dif the term "bias" does not 
accurately describe the situation. 

Early work at ETS on dif began with Cardall and Coffman (1964) and Angoff 
and Ford (1973). The book by Berk (1982) summarizes research to 1980. 

The following notational scheme and terminology is used in the rest of this 
paper. We will always be comparing two groups of examinees, of which the per- 
formance of one, the focal grou p, F, is of primary interest. The performance of 
the other groap, the reference group , R, is taken as a standard against which we 
will compare the performance of the focal group. For example, the focal group 
might be al] black examinees while the reference group might consist of the 
white examinees. Typically, all test items in a given testing instrument will 



be analysed for evidence of dif , and this will be done one item at a time. We 
will refer to the item that is being examined for evidence of dif in a given 
analysis as the studied item . 

Basic to all modern approaches to the study of di f is the notion of com- 
paring only comparable members of F and R in attenipting to identify i^ems that 
exhibit dif. Comparability means identity in those measured characteristics in 
which examinees may differ and that are strongly related to performance on the 
studied item. Important among the criteria used to define comparability are (a) 
measures of the ability for which the item is designed, (b) schooling or other 
measures of relevant experience, and (c) membership in other groups. In prac- 
tice, the matching criteria will usually include test scores since these are 
available, accurately measured, and usually measure the same ability as the 
studied item. 

If both examinee ability and item characteristics a**" confounded by simply 
measuring the difference in the performance on an item between unmatched 
reference and focal group members, the result is a measure of impact rather than 
of differential item performance. For example, comparing the proportion of 
reference and focal group members who give correct answers to a given item is a 
measure of the item's impact on the focal group relative to the reference group. 
In this paper we do not discuss impact, since the confounding of differences in 
examinee ability with characteristics of items is of little utililty in 
attempting to identify items that may truely disadvantage some subpopulations of 
examinees . 

Suppose that criteria for matching have been selected, then the data for 
the studied item for the examinees in R and F may be arranged into a series of 
2^2 tables; one such table for each matched set of reference and focal group 
members. The data for the performance of tha ]— matched set on the studied 



item is displayed below 



Score on Studied Item 



Group 





1 


0 


Total 


R 


Aj 






F 








Total 




■noj 





Table 1: Data for the jit matched set of members of K and F. 
In Table 1, Tj is the total number of reference and focal group members in 
the jill matched set; n^j is the number of these who are in R; and of these Aj 
answered the studied item correctly. The other entries in Table 1 have similar 
definitions , 

In order to state statistical hypotheses precisely, it is necessary to have 
a sampling model for the data in Table 1. It is customary to act as though the 
values of the marginal totals, nj^j and np j , are fixed and to regard the data for 
R and F as having arisen as random samples of size nj^j and npj from large 
matched pools of reference and focal group members. It follows that Aj and Cj 
are independent binomirl variates with parameters (n^ j , PRj) and (r^Fj' PFj)' 
respectively. These population values can be arranged as a 2x2 table that is 
parallel to Table 1; i.e.. 

Score on Studied Item 

1 0 Total 



Group 



R 


PRj 


'iRj 


1 


F 


PFj 


qFj 


1 



Table 2: Population parameters for data from the jlii matched set. 
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Tlie hyputhe:sis of no dlf l.Oi leSpOuub to the null hypOthes>is> . 

"O ' PRj = PFj all j. 

The hypothesis, Hq, is also the hypothesis of conditional independence of 
group uaembership and the score on the studied item given the matching variable 
(Bishop, Fienberg, and Holland, 1975). 

Under Hq, the "expected values" foi the cell entries of Table 1 are well- 
knowr to be obtained by the "product of margins over total" rule and are sum- 
marized below 

E(Aj) = npj mij/Tj E(Cj) = npj mij/Tj 

E(Bj) = npj moj/Tj E(Dj) = npj mQ^/Ty (1) 

2. PREVIOUS CHI-SQUARE PROCEDURES 

Scheuneman (1979) proposed a procedure to test the hypothesis, Hq, 
utilizing a specific type of matching criterion. Let S denote a score on a cri- 
terion test — e.g., an operational test score that may or may not irclude the 
studied item. The values of S are categorized into a few intervals — 
Scheuneman suggests that three to five intervals are satisfactory The matched 
groups are defined by the categorized values of S so that members of R and F are 
considered matched if their scores on S fall inro the same score interval. In 
terms of the notation of section 1, the test statistic proposed by Scheuneman is 
given by 



K 

SCHEUN = I 



(A, - E(A ))2 (C - E(C ))^ 



E(Aj) 



E(Cj) 



(2) 



which is algebraically equal to 



SCHEUN = I 



( Aj - E(Aj) )' 



j=l [ "Rj "Fj ■"Ij/Tj 
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It waG crigir.ally thought that SCHEUN had an approxirr?.''"^ rhi -square distri- 
bution on K-1 degrees of freedom when Hq is true, Schueneman (1979). This is 
not correct as discussed in Barker (1981) and Scheuneman (1981). For example, 
under Hq, the expectation of SCHEUN, conditional on the four marginal values 
"Rj> "^Fj' ""Ij' "Oj 2x2 table, is given by 



E(SCHEUN) = I TTp^ 



(3) 



^ (Tj-1) 



This value is sensitive to the total -nniber of incorrect responses in each 2x2 
table and can range from 0 up to K . M SCHEUN had an approximate chi-square 
distribution on K-1 degrees of freedom then the expected value in (3) would be 
approximately K-1 for any set of values of mQj. Fortunately, a small correction 
to (2) does give the resulting statistic an approximate chi-square distribution 
under Hq. The corrected statistic is 



K T. 

CHISQ-P = S — ^ 
j=l "-Oj 



(A. - E(AJ)^ (C. - E(C.)) 



E(Ai) 



E(Cj) 



(4) 



which can be shown to be algebraically identical to 



K 

CHISQ-P = S 
j=l 



(A - E(A ))^ 



r^Rj riFj n»oj mij/Tj 



(5) 



This is well-known to be the Pearson chi-square test statistic for testing Hq 
and the propei degrees of freedom equals the number of matched groups, K, if the 
Tj are all large, Bishop, Fienberg, and Holland (1975). It is also called the 
'*full" chi-square by some to distinguish it from SCHEUN. 

The K 2^2 tables may be regarded as a single 2x2xK table and the standard 
theory of log-linear models for three-way tables may be used to test Hq. This 
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leads to the suggestion of Marascuilo and Slaughter (1981) and Mellenberg (1982) 
to use the likelihood ratio chi-square statistics to test H() instead of (5). 

The alternative hypothesis against which Hq is tested by CHISQ-P (and its 
likelihood-ratio versions) is simply the negation of Hq, i.e., 

^0 • PRj ^ PFj some j. 

This is why CHISQ-P is a multi-degree of freedom chi-square test. It is not 
powerful against specific alternatives to Hq , but it will detect any such depar- 
ture if the Tj are large enough. This fact leads to a trade-off between bias 
and statistical power that is not well made, in our opinion, by procedures like 
Scheuneman's or "methods 1, 2, 4, 5, and 6" of Marascuilo and Slaughter (1981). 
The trade-off arises by the desire to increase the values of Tj in order to 
increase the power of the test. This degrades the quality of the matching 
(i.e., liimps together examinees whose scores are not equal) in order to increase 
the sample sizes in the matched groups, i.e., Tj . This is necessary in these 
procedures because of the goal of being able to datect any type of departure 
from Hq. An alternative approach, and one that we favor, is to reduce the typ«s 
of alternatives to Hq against which the test has good pouer and to concentrate 
this power into a few degrees of freedom that actually occur in test data. This 
occurs in Method 3 of Marascuilo and Slaughter (1981) . Mellenberg (1982) has 
moved in this direction by distinguishing "uniform" from "non-uniform bias." 
The M-H procedure does this by concentrating on Mellenberg*s uniform bias and 
yet it does not degrade the quality of the matching. We will discuss this in 
the next section. 

A separate problem with the chi-square procedures is that they are only 
tests of Hq and do not produce a parametric measure of the amount of dif 
exhibited by the studied item. As is well-known, tests will always reject the 
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null hypothesis provided that the relevant sample sizes are large enough. It is 
more informative to have a measure of the size of the departure ot the data from 
Hq. The M-H procedure provides such a measure. 
3. THE MANTEL-HAENSZEL PROCEDURE 

In their seminal paper, Mantel and Haenszel (1959) introduced a new proce- 
dure for th(i study of matched groups. The data are in the form of K 2^2 tables 
as in Table 1. They developed a chi-square test of Hq against the specific 
alternative hypothesis 

PR] PFj 

Hi : — = « j = 1,...,K (6) 

^ qRj ^Fj 

for a?^] . Note that a=l corresponds to Hq, which can also be expressed as: 

Ho r — = — j = 1,...,K. (7) 

^ qRj ^Fj 

The parameter a is called the common odds-ratio in the K 2x2 table because under 
Hi » the value of a is the odds ratio 

= ^Li / !L1 = !RJ_!L1 for all j = 1,...,K. (8) 

qRj qFj PFj qRj 

The Mantel-Haenszel chi-square test statistic is based on 

a (A. - I E(A.)))^ 
j J j J (9) 



I Var(Aj) 



where E(Aj) is defined in (1) and 



nRj npj mij mQj 

Var(A;) = ; • (10) 

J T^(Tj-l) ^ ^ 

The statistic in (9) is usually given a continuity connection to improve the 
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accuracy of the chi-square percentage points as approximations to the observed 
significance levels. This has the form 

(|I - I E(A.)| - i)^ 

MH-CHISQ = — . (11) 

X Var(A;) 
j ^ 

It may be shown, for example Birch (1964) or Cox (1970), that a test based 
on MH-CHISQ is the uniformly most powerful unbiased test of Hq versus H^ . Hence 
no other test can have higher power somewhere in H]^ than the one based on 
MH-CHISQ unless the other test violates the size constraint on the null hypothe- 
sis or has lower power than the test's size somewhere else on H^ . Under Hq, 
MH-CHISQ has an approximate chi-square distribution with one degree of freedom. 
It corresponds to the single degree of freedom chi-square test given by 
Mellenberg (1982) for testing no "bias" against the hypothesis: ot "uniform 
bias." It Is not identical to the test proposed by Mellenberg but in many prac- 
tical situations they give virtually identical results even though Mellenberg's 
proposal involves an iterative log-linear model fitting process. The MH pro- 
cedure .s not iterative. 

Mantel and Haenszel also provide an estimate of a, the common odds-ratio 
across the 2x2 tables. Their estimator is given by 

I A. D./T. 

The odds-ratio i.s on the scale of 0 to with a=l playing the role of a 
null value of no dif . It is convenient to take logs of dj^ to put it into a 
symmetric scale in which 0 is the null value. Thus we have proposed that 

AmH = - 1^ In(aMH) = -2.35 ln(c:„H) (13) 
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be used as a measure of the amount of dii' . Aj^ has the interpretation of being 
a measure of dif in che scale of differences in item difficulty as measured in 
the ETS "delta scale," (Holland and Thayer, 1985). 

When using ctj^ or Aj^ it is useful to have a simple interpretation of these 
values. The value of ctj^ is the average factor by which the odds that a member 
of R is correct on the studied item exceeds the corresponding odds for a com- 
parable member of F. Values of ctj^ that exceed 1 correspond to items on which 
the reference group performed better on average than did comparable members of 
the focal group. The value of -Aj^jj is the average amount more difficult that a 
member of R found the studied item than did comparable members of F. Values of 
Aj^ that are negative correspond to items that the reference group found easier 
on average than did comparable focal group members. The parameters, a and 
ln(a) , are also called "partial association" parameters because they are 
analogous to the partial correlations used with continuous data. The matching 
variable is "partialled out" of the association between group membership and 
performance on the studied item, (Birch, 1964). 

Mantel and Haenszel proposed both the test statistic MH-CHISQ and the para- 
meter estimate a.^. Since that initial work many authors have contributed to 
the study of these procedures; the main results are as follows, 
(a) The effect of the continuity correction is to improve the calculation of 
the observed significance levels using the chi-square table rather than to 
make the size of the test equal to the nominal value. Hence simulation 
studies routinely find that the actual size of a test based on MH-CHISQ is 
smalle r than the nominal value. However, the observed significance level 
of a large value of MH-CHISQ is better approximated by referring MH-CHISQ 
to the chi-square tables than by referring the expression ir (8) to these 
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tablej. The continuity correction is simply to improve the approximation 
of a discrete distribution (i.e^, MH-CHISQ) by a continuous distribution 
(i.e., one degree-of-f reedom chi-square) . 

(b) aj^ is a consistent estimator of the a in (8) and the variability of a^H is 
nearly optimal over the range ^ ^ ^ which ' anslates into -2.6 < A < 2.6 
under the log transformation in (13). Outside this range or are 
still reasonably efficient, but very large (or small) values of a are not 

as accurately estimated by ctj^ as they are by maximum likelihood. Since 
larger values of a are easy to detect using MH-CHISQ, this is not an impor- 
tant limitation. 

(c) Standard error formulas for ctj^ and Aj^ that work in a variety of cir- 
cumstances have taken a long time to develop. Important contributions have 
been Hauck (1979), Breslow and Laing (1982), and Flanders (1985). Recent 
joint work with A. Phillips suggests that the following approximate 
variance formula for In(dj^) is valid whenever the numerator and denomina- 
tor of ctj^ are both large: 

Var(ln(dMH)) = ^ ^ (^j'^^j Dj + Bj Cj)(Aj + Dj + a^{B^ + Cj))], (14) 
2U j 

where 

U = X Aj Dj/Tj. 

This approximate variance formula agrees with well-known variance estimates 
for In(dj^) in the few cases in which these are available. It is discussed more 
extensively in Phillips and Holland (1986). 

It is sometimes helpful to show how ctj^ is expressed as a weighted average 
of the sample cross-product ratios in each of the K 2x2 tables. These are the 
values 
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Cj • (15) 
Hence 

^ "l (16) 

where 

wj = Bj Cj/Tj. 

In their discussion of chi-square techniques, Harascuilo and Slaughter consider 
Cochran's (1954) test. In this test, instead of using the odds-ratio in each 
table as a measure of dif in xhe j^Il matched group, the difference in proportion 
is used, i . 

np j npj ■ 

These are averaged together with the weights, 

^Rj ^Fj/Tj, 

to get an overall a^^erage difference across all matched groups. More recently, 
Dorans and Kulick (1986) have suggested applying the weights npj to the dif- 
ference in (17) to get an overall standardized measure of dif for the item. 
Dorans and Kulick do not develop a test based on their measures, but it is evi- 
dent that such a test, similar to Cochran's test, could be developed. Since 
Dorans and Kulick are primarily interested in a good descriptive measure of dif 
'•heir choice of weights does not correspond to a statistically optimal test of 

In summary, the Mantel-Haenszel procedure is a natural extension of the 
ideas behind the chi-square procedures of Scheuneman and others. It Provides a 
single degree~of~freedom chi-square test that is powerful against realistic 
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alternatives to Hq, it allows detailed and careful Oiatching on relevant 
criteria, and it ^ ovides a single summary measure of the magnitude of the 
departure from Hq exhibited by the stud-ed item. 
4. THE MH PROCEDURE AND IRT MODELS 

It is generally believed that there is, at best, only a rough correspondence 
between the "chi-square" types of procedures for studying dif and the more 
"theoretically preferred" methods based on item response theory (IRT). For 
examples of this view see Scheuneman (1979), Marascuilo and Slaughter (1981) and 
Shepard, Camilli, and Williams (1984). In this section we show that the MH pro- 
cedure highlights a close connection between these two important classes of pro- 
cedures. Our observations on this point are stongly influenced by the work of 
our colleague, Paul Rosenbaum — see Rosenbaum (1985, 1986). 

We adopt th*^ notation and terminology for discussing IRT models given in 
Holland (1981), Cressie and Holland (1983), Rosenbaum (1984), and Holland and 
Rosenbaum (1986). Thus x^ is the 0/1 indicator of a correct response in item k, 
k=l,..., J, and x = (x]^, xj) denotes a generic response vector — there are 

2^ poss^'ble values of x. In any population of examinees we let p(x) denote the 
proportion of them who would produce the response vector x if tested. Then 

p(x) ^ 0, X p(x) - 1. 

X 

An IRT model assumes that the value of p(x) is specified by an equation of 
the form 



P(x) = / 



J x^ l~^k 

n PkCe) Qk(e) 

k=l 



dG(e). (18) 



In (18), Pk(6) = 1 " Qk^^^) is the item characteristic curve (ICC) for item k and 
G(6) is the distribution function of the latent ability, 6, across the popula- 
tion of examinees. 

14 
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it IS customary to restrict the ICCs ana o in variuus wayb . rOi. exampxe, yj 
is usually a scaler (not a vector) and the P^^ arc assumed to be monotone 
increasing functions of 9. Holland and Rosenbaum (1986) point out chat without 
some restriction to this type IRT models are vacuous. Parametric assumptions 
such as the 1-, 2-, or 3-parameter logistic form for Pk(9) n^ay also be imposed. 

ir there are two subpopulations of examinees, R and F, then there are 
corresponding values Pr(x) and Pf(x) . In general, *^ach subpopulation will have 
its own ICCs, i.e. 

Pl^(9) and PkF(9) k=l, . . . ,J 
as well as its own ability distribution, 

Gr(9) and Gf(9) . 

Lord (1980) states the hypothesis of no dif in terms of an IRT model. For 
xtem k it is 

Ho(IRT) • ^'kR(9) = PkF(9) = ^kC^) all 9. 

Thus, if Ho(iRT) holds for all k then Pr(x) and Pf(x) have the representations: 



Pr(x) = / 



n Pk(9) Qk(9) 
k=l 



n Pk(9) Qk(9) 
k=l 



dGR(9) 



(19) 



dGF(9) 



Rosenbaum (1985) considers tests of the hypothesis that a representation like 
(19) exists for Pr(x) and i ^. (x) in which R has a "higher " distribution of 9 than 
does F. 

The integrals in (18) and (19) are not easy to work with except in one 
special case, i.e., the Rasch model. For this model has the logistic form 
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fl-hu 

■ IV, 



Pk(e) = e /(I + e ). (20) 
If (20) is inserted into (18) then dolland and Cressie (l983) show that 
p(x) may be expressed as 



p(x) = p(0) 



J Xk 

n fk 

.k=l 



Vi(x+) 



where 



(21) 



(22) 



x+ = I Xk 
k 

and p(t) = ECUM t=0,l,...,J. (23) 

In (23) , U is a positive random variable whose distribution depends on the ICCs 
and on the ability distribution 6(6). Hence if we apply (21) to pr(x) and Pyix.) 
without assuming Ho(ipx) we get 



Pr(x) = pr(0) 



J Xk 

n fkR 

k=l 



vr(x+) 



(24) 



and 



pp(x) = pf(0) 



- J 

n fkF 

k=l 



Xk 



(25) 



Now suppose that we wish to apply the MH procedure in this situation and 
that we take as the matching variable the total score on the test X+. If iterr. 1 
is the studied item then the relevant population probabilities for R are of the 
form 

PRj = P(Xi = 1|X+ = j,R). 
Using (24) this can be expressed as 



PRj 



P(Xi = 1, X+ = j) ^11. (f*) 



P(X+ = i) 



^ IR J-l,j-r R^ 
Sj.j (fR) 



(26) 
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where 



and 



= (flR»-'-»fjR) = (flR> ^r) 



Sj,j(f) = s 

X : x+=j 



n fk 

k=l 



(i.e., the symmetric function of J-variables of degree j). 
Simi larly , 



Hence the odds for success on item 1 in R in the j— matched set are 



qRj 



Rj _ , "J-l,.1-r R 

- fiR St , -rf-n 



(27) 



(28) 



Similar equations hold for ppj and qpj and the corresponding odds are 

^Fj Sj-l,j(^F^ 



(29) 



Now suppose that for items 2 through J there is no di f , i.e., 

fkF = ^kR ^=2, . . . ,J, 
SO that fp - fj^. 

Then the population odds-ratio in each 2^2 table is 



qRj ^Fj ^IF 



(30) 



Equation (30) is a statement of Hj^ in (6) with a = e , so that for 

the Rasch model the hypothesis for which the MH procedure was developed holds 
exactly in the population under the following conditions. 
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(a) The Items 2, 3, J exhibit no dif , but the studied item may 
exhibit dif > 

(b) the criterion for matching, X+, includes the studied item, 

(c) the data are random samples from R and F, 

This result is a little surprising since the inclusion of the studied item 

in the criterion seems to go against the traditional uses of the MH procedures 

in medical applications. However, it can be shown that if the studied item is 

excluded from the criterion then the null hypothesis Hq is not satisfi^.d even 

though Hq(jrt) is satisfied for every item. 

For example, when the criterion for matching is Xvc = T and item 1 is 

k>2 

the studied item, the relevant population probabilities are 



PRj = P(Xj = 1 I X,v = j,R) 



and 



PFj ^ P(Xi = 1 I X,v = j,F). 
It is easy to show that the equations that corresponds to (28) and (29j are 



hi , ^ 

qRj 



(31) 



and 



hi,, 



IF 



Up(j+1) 



Hence the odds-ratio in the jlll matched set is 



(32) 



IR 
IF 



I 



VR(j) PfU) 



(33) 



Thus the aj are not constant across the 2x2 tables. The ratio of moments in 
(33) is related to order relationships between the distributions of the random 
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variable U, from (23), in R and F, For example, if the distribution of U for F 
is "lower" than that for R then we will have 



/ ... > 1, for all j = 1, 2, . 



yR(j) ^fCj) 

This an. 'ysis raises the issue of whether the studied item should be 
included or not in the matching criterion. If it is not included, then the MH 
procedure will not behave correctly when there is no dif according to an IRT 
moael. However the Rasch model analysis suggests that the inclusion of the 
studied item in the matching criterion does not mask the existence of dif, 
rather it is the inclusion of other items exhibiting dif in the criterion that 
could lead to the finding that no dif exists for the studied item when in fact 
it does. This idea leads to two steps. 

Step 1 : Purify the matching criterion by eliminating items based on a 

preliminary dif or impact analysis (Kok et al, 1985, make a simi- 
lar suggestion) . 

Step 2 : Use as the matching criterion the total score on all items left in 
the purified criterion plus the studied item — even if it is then 
omitted from the criterion of all other items when they are 
studied in turned. 

It is possible that we have drawn too heavily on the analysis of the Rasch 
model and a good deal of simulation work may be necessary before we know for 
sure if our suggestions hold in greater generality. We have begun some of that 
work and will report on it later. However, to date the results of the simula- 
tion study corraborates our proposal regarding the inclusion or exclusion of the 
studied item in the criterion. 
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Tie issue of including and excluding an item from the criterion shows the 
need for making these adjustments in the computational formulas for dj^]^ and 
MH-CHISQ. These are as follows. 

If the K 2x2 tables have been assembled for a number right score S as the 
matching criterion that does not include the studied item and we wish to include 
it in the score, then the 2x2 tables need to be altered to these. 

Score on Studied Item 





1 


0 


Total 




Aj-1 


h 




F 










■"Ij-l 




t: 



The values of MH-CHISQ and ayQ{ are then computed from these tables. 

Similarly, if S contains the score of the studied item and we wish to eli- 
minate it this is done by using these 2x2 tables. 

Score on Studied Item 





1 


0 


Total 


R 


Aj + 1 


«j 




F 


Cj + i 


°j 


"Fj 




mij + i 


mOj 


t:' 



Thus it is a simple matter to compute either ctj^ or MH-CHISQ including or 
excluding the studied item from a number-right-score matching criterion. If the 
matching criterion is a formula-score or a grouped, number-right-score then it 
is not easy to adjust for the inclusion of the studied item into the criterion, 
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without recalculating the entire set of 2x2 tables. 
5. DISCUSSION 

There are many procedures that have been proposed for the study of dif over 
the last twenty years, and the introduction of a new one, associated with names 
that are unfamiliar to psychometricians , is likely to oe regarded skeptically. 
However, we have tried to show that the MH procedure, drawn from the field of 
biostatistics, fits squarely into the network of ideas developed by previous 
workers in the field of "item bias". In addition, standard statistigal con- 
cepts, such as tests, hypotheses, error of type I and II, estimates, and stan- 
dard errors all fit neatly into the package. Connections between chi-square 
methods and IRT based methods are made evident by studying the Mantel-Haenszel 
procedure . 

We believe that the view that IRT based approaches to dif are 
"theoretically preferred" over chi-square based procedures is not a very precise 
way of describing the situation. It is certainly true that likelihood ratio 
tests of Ho(iRT) in the context of specific parametric IRT models (i.e., 3PL 
ICCs and Normal 9-distributions) are statistically optimal (or very nearly so) 
in the sense of power and efficiency when these models actually hold . If the 
data really are generated by such models, as they would be in a simulation, then 
no other test of the equality of two ICCs for the same item, at the given signi- 
ficance level can have larger power than these likelihood ratio tests. However, 
it is only the procedures based on marginal maximum likelihood, as advocated by 
Bock and Aitkin (1981), that can yield true likelihood ratio tests (e.g., see 
Thissen, Wainer, and Steinberg (1985).) IRT-based procedures that depend on 
multii-le LOGIST calibrations do not automatically result in tests of Ho(irt) 
estimates of ICC differences that are optimal. Furthermore, even the marginal 
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maximum likelihood procedures are not optimal when the assumed model is wrong. 

In our view, parametric IRT models provide an important testing ground for 
evaluating dif procedures. Under ^^o(IRT)j test statistics ought to achieve 
significance levels that are close to the nominal values regardless of the 
choices of Gr, Gp, and the ICCs, P^CQ)- Against alternatives to Ho(irt)j 
likelihood ratio procedure will set the upper bounds on the power and efficieny 
of any test proced'^^e, including the LOGIST-based procedures or chi-square pro- 
cedures like the Mantel-Haenszel . Our use of a specific IRT model (the Rasch 
model) to evaluate the Mantel-Haenszel procedure resulted in a new conception of 
the importance of including or excluding the studied item in the criterion. 
This shows the advantage of a theoretical analysis. We were led to that analy- 
sis by the empirical finding that including the studied item in the test score 
used as a matching criterion had a measurable and consistent effect on the 
values of aif^ and Aj^ computed in real data. The Aj^ values will shift by an 
amount that is nearly independent of the studied item but which did depend on 
the overall differences in performance on the criterion test between R and F. 
The bigger the difference the bigge^ the shift. This is exactly what is pre- 
dicted by equation (33) when there are large differences between the 6- 
distributions , Gp and Gj^. 

Our conjecture is that it is correct to include the studied item in the 
matching criterion when it is being analysed for dif, but if it has substantial 
dif then that item should be excluded from the criterion used to match examinees 
for any other studied item. The first "inclusion" is to control the size of the 
test given by MH-CHISQ while the second "exclusion" is to prevent large dif 
items from degrading the power of this test. Such an approach is independent of 
the MH procedure and can be incorporated into other chi-square techniques, or 
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into the iterative logit technique discussed by Kok et al . (1985). 

A firal note on costs. The MH procedure is very inexpensive to use com- 
pared to IRT analyses. For example, runs that involve 50 items and 2500 
examinees cost about $10 on a typical mainframe computer. Our main reason for 
pursuing this approach has been to provide ETS with a practical, and yet power- 
ful tool for the study of dif that incorporates all of the advances in metnodo-- 
logy that have occurred since the late 1970s. 
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